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Abstract 


We study the problem of black-box optimization of a function f of any dimen- 
sion, given function evaluations perturbed by noise. The function is assumed to 
be locally smooth around one of its global optima, but this smoothness is un- 
known. Our contribution is an adaptive optimization algorithm, POO or parallel 
optimistic optimization, that is able to deal with this setting. POO performs almost 
as well as the best known algorithms requiring the knowledge of the smoothness. 
Furthermore, POO works for a larger class of functions than what was previously 
considered, especially for functions that are difficult to optimize, in a very precise 
sense. We provide a finite-time analysis of POO’s performance, which shows that 


its error after n evaluations is at most a factor of v ln n away from the error of the 
best known optimization algorithms using the knowledge of the smoothness. 


1 Introduction 


We treat the problem of optimizing a function f : Æ — R given a finite budget of n noisy evalua- 
tions. We consider that the cost of any of these function evaluations is high. That means, we care 
about assessing the optimization performance in terms of the sample complexity, i.e., the number 
of n function evaluations. This is typically the case when one needs to tune parameters for a complex 
system seen as a black-box, which performance can only be evaluated by a costly simulation. One 
such example, is the hyper-parameter tuning where the sensitivity to perturbations is large and the 
derivatives of the objective function with respect to these parameters do not exist or are unknown. 


Such setting fits the sequential decision-making setting under bandit feedback. In this setting, the 
actions are the points that lie in a domain #. At each step t, an algorithm selects an action x; € # 
and receives a reward r+, which is a noisy function evaluation such that r; = f (x+) + E+, where €; is 
a bounded noise with E |e; |x; ] = 0. After n evaluations, the algorithm outputs its best guess x(n), 
which can be different from £n. The performance measure we want to minimize is the value of the 
function at the returned point compared to the optimum, also referred to as simple regret, 


Rn = sup f(z) — f (z (n)). 


LEX 














We assume there exists at least one point x* € ¥ such that f(x*) = sup, cy f(x). 


The relationship with bandit settings motivated UCT [10, 8], an empirically successful heuristic 
that hierarchically partitions domain # and selects the next point x; € ¥ using upper confidence 
bounds [1]. The empirical success of UCT on one side but the absence of performance guarantees for 
it on the other, incited research on similar but theoretically founded algorithms [4, 9, 12, 2, 6]. 


As the global optimization of the unknown function without absolutely any assumptions would 
be a daunting needle-in-a-haystack problem, most of the algorithms assume at least a very weak 
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assumption that the function does not decrease faster than a known rate around one of its global 
optima. In other words, they assume a certain local smoothness property of f. This smoothness 
is often expressed in the form of a semi-metric that quantifies this regularity [4]. Naturally, this 
regularity also influences the guarantees that these algorithms are able to furnish. Many of them 
define a near-optimality dimension d or a zooming dimension. These are ¢-dependent quantities 
used to bound the simple regret R, or a related notion called cumulative regret. 


Our work focuses on a notion of such near-optimality dimension d that does not directly relate 
the smoothness property of f to a specific metric £ but directly to the hierarchical partitioning 
P = {Ph i}, a tree-based representation of the space used by the algorithm. Indeed, an interesting 
fundamental question is to determine a good characterization of the difficulty of the optimization 
for an algorithm that uses a given hierarchical partitioning of the space ¥ as its input. The kind of 
hierarchical partitioning {P} ;} we consider is similar to the ones introduced in prior work: for any 
depth h > 0 in the tree representation, the set of cells {Phi hi<i< 1, form a partition of X, where J}, 
is the number of cells at depth h. At depth 0, the root of the tree, there is a single cell Po = #. A 
cell Ph ; of depth A is split into several children subcells {P),1,;}; of depth A + 1. We refer to the 
standard partitioning as to one where each cell is split into regular same-sized subcells [13]. 


An important insight, detailed in Section 2, is that a near-optimality dimension d that is independent 
from the partitioning used by an algorithm (as defined in prior work [4, 9, 2]) does not embody the 
optimization difficulty perfectly. This is easy to see, as for any f we could define a partitioning, 
perfectly suited for f. An example is a partitioning, that at the root splits ¥ into {x*} and ¥ \ x*, 
which makes the optimization trivial, whatever d is. This insight was already observed by Slivkins 
[14] and Bull [6], whose zooming dimension depends both on the function and the partitioning. 


In this paper, we define a notion of near-optimality dimension d which measures the complexity of 
the optimization problem directly in terms of the partitioning used by an algorithm. First, we make 
the following local smoothness assumption about the function, expressed in terms of the partitioning 
and not any metric: For a given partitioning P, we assume that there exist v > 0 and p € (0, 1), s.t., 


Vh > 0,Vr € Prix, f(x) > f(æ*) — up", 


where (h, i*) is the (unique) cell of depth h containing x*. Then, we define the near-optimality 
dimension d(v, p) as 


d(v, p) £ inf {a ERt:3C > 0, Vh > 0, N,(2vp") < co») | 


| 





where for all € > 0, Mh (€) is the number of cells Pp; of depth A s.t. sup,ep, , f(x) = f (a*) — €. 
Intuitively, functions with smaller d are easier to optimize and we denote (v, p), for which d(v, p) is 
the smallest, as (1, px). Obviously, d(v, p) depends on P and f, but does not depend on any choice 
of a specific metric. In Section 2, we argue that this definition of d! encompasses the optimization 
complexity better. We stress this is not an artifact of our analysis and previous algorithms, such as 
HOO [4], TaxonomyZoom [14], or HCT [2], can be shown to scale with this new notion of d. 


Most of the prior bandit-based algorithms proposed for function optimization, for either determinis- 
tic or stochastic setting, assume that the smoothness of the optimized function is known. This is the 
case of known semi-metric [4, 2] and pseudo-metric [9]. This assumption limits the application of 
these algorithms and opened a very compelling question of whether this knowledge is necessary. 


Prior work responded with algorithms not requiring this knowledge. Bubeck et al. [5] provided an 
algorithm for optimization of Lipschitz functions without the knowledge of the Lipschitz constant. 
However, they have to assume that f is twice differentiable and a bound on the second order deriva- 
tive is known. Combes and Proutiére [7] treat unimodal f restricted to dimension one. Slivkins 
[14] considered a general optimization problem embedded in a taxonomy’ and provided guarantees 
as a function of the quality of the taxonomy. The quality refers to the probability of reaching two 
cells belonging to the same branch that can have values that differ by more that half of the diameter 
(expressed by the true metric) of the branch. The problem is that the algorithm needs a lower bound 
on this quality (which can be tiny) and the performance depends inversely on this quantity. Also it 
assumes that the quality is strictly positive. In this paper, we do not rely on the knowledge of quality 
and also consider a more general class of functions for which the quality can be 0 (Appendix E). 





lwe use the simplified notation d instead of d(v, p) for clarity when no confusion is possible 


which is similar to the hierarchical partitioning previously defined 
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Figure 1: Difficult function f : x > s (log, |æ — 0.5|) : (fæ — 0.5] — (x — 0.5)”) |x — 0.5] 
where, s(x) = 1 if the fractional part of x, that is, x — |x], is in [0,0.5] and s(x) = O, if it is in 
(0.5, 1). Left: Oscillation between two envelopes of different smoothness leading to a nonzero d for 
a standard partitioning. Right: Simple regret of HOO after 5000 evaluations for different values of p. 


Another direction has been followed by Munos [11], where in the deterministic case (the function 
evaluations are not perturbed by noise), their SOO algorithm performs almost as well as the best 
known algorithms without the knowledge of the function smoothness. SOO was later extended to 
StoS00 [15] for the stochastic case. However StoS00 only extends SOO for a limited case of easy 
instances of functions for which there exists a semi-metric under which d = 0. Also, Bull [6] 
provided a similar simple regret bound for ATB for a class of functions, called zooming continuous 
functions, which is related to the class of functions for which there exists a semi-metric under which 
the near-optimality dimension is d = 0. But none of the prior work considers a more general class 
of functions where there is no semi-metric adapted to the standard partitioning for which d = 0. 


To give an example of a difficult function, consider the function in Figure |. It possesses a lower 
and upper envelope around its global optimum that are equivalent to x? and ,/x; and therefore 
have different smoothness. Thus, for a standard partitioning, there is no semi-metric of the form 
L(x, y) = ||” — y||% for which the near-optimality dimension is d = 0, as shown by Valko et al. 
[15]. Other examples of nonzero near-optimality dimension are the functions that for a standard 
partitioning behave differently depending on the direction, for instance f : (x,y) ++ 1 — |z| — y?. 


Using a bad value for the p parameter can have dramatic consequences on the simple regret. In 
Figure |, we show the simple regret after 5000 function evaluations for different values of p. For the 
values of p that are too low, the algorithm does not explore enough and is stuck in a local maximum 
while for values of p too high the algorithm wastes evaluations by exploring too much. 


In this paper, we provide a new algorithm, POO, parallel optimistic optimization, which competes 
with the best algorithms that assume the knowledge of the function smoothness, for a larger class 
of functions than was previously done. Indeed, POO handles a panoply of functions, including hard 
instances, i.e., such that d > 0, like the function illustrated above. We also recover the result of 
StoS00 and ATB for functions with d = 0. In particular, we bound the P00’s simple regret as 


[Rn] < O (m? n) /n)” ee, 














This result should be compared to the simple regret of the best known algorithm that uses the knowl- 
edge of the metric under which the function is smooth, or equivalently (v, p), which is of the order of 
O((Inn/n)!/@+4), Thus P00’s performance is at most a factor of (In n)!/(?+% away from that of 
the best known optimization algorithms that require the knowledge of the function smoothness. In- 
terestingly, this factor decreases with the complexity measure d: the harder the function to optimize, 
the less important it is to know its precise smoothness. 


2 Background and assumptions 


2.1 Hierarchical optimistic optimization 


POO optimizes functions without the knowledge of their smoothness using a subroutine, an anytime 
algorithm optimizing functions using the knowledge of their smoothness. In this paper, we use a 
modified version of HOO [4] as such subroutine. Therefore, we embark with a quick review of HOO. 


HOO follows an optimistic strategy close to UCT [10], but unlike UCT, it uses proper confidence 
bounds to provide theoretical guarantees. HOO refines a partition of the space based on a hierarchical 
partitioning, where at each step, a yet unexplored cell (a leaf of the corresponding tree) is selected, 


and the function is evaluated at a point within this cell. The selected path (from the root to the leaf) 
is the one that maximizes the minimum value U; ;(t) among all cells of each depth, where the value 
Un, i(t) of any cell Ph; is defined as 


21n(t) 
Nn ilt) 





Un, i(t) = Ani (t) + +p", 

where t is the number of evaluations done so far, {in s(t) is the empirical average of all evaluations 
done within P, ;, and N, ;(t) is the number of them. The second term in the definition of Up, (t) is 
a Chernoff-Hoeffding type confidence interval, measuring the estimation error induced by the noise. 
The third term, vp” with p € (0,1) is, by assumption, a bound on the difference f(x*) — f(x) for 
any x € Phi, a cell containing x*. It is this bound, where HOO relies on the knowledge of the 
smoothness, because the algorithm requires the values of v and p. In the next sections, we clarify 
the assumptions made by HOO vs. related algorithms and point out the differences with POO. 


2.2 Assumptions made in prior work 


Most of previous work relies on the knowledge of a semi-metric on 4 such that the function is either 
locally smooth near to one of its maxima with respect to this metric [11, 15, 2] or require a stronger, 
weakly-Lipschitz assumption [4, 12, 2]. Furthermore, Kleinberg et al. [9] assume the full metric. 
Note, that the semi-metric does not require the triangular inequality to hold. For instance, consider 
the semi-metric (x,y) = ||x — y||* on R? with || - || being the euclidean metric. When a < 1 
then this semi-metric does not satisfy the triangular inequality. However, it is a metric for a > 1. 
Therefore, using only semi-metric allows us to consider a larger class of functions. 


Prior work typically requires two assumptions. The first one is on semi-metric £ and the function. 
An example is the weakly-Lipschitz assumption needed by Bubeck et al. [4] which requires that 


Vry EX, f(2*)— fly) < f(a) — f(x) + max { f(a") — f(x), £ (x, y)}- 
It is a weak version of a Lipschitz condition, restricting f in particular for the values close to f(x*). 
More recent results [1 1, 15, 2] assume only a local smoothness around one of the function maxima, 


EX f(a") — f(x) < a", 2). 


The second common assumption links the hierarchical partitioning with the semi-metric. It requires 
the partitioning to be adapted to the (semi) metric. More precisely the well-shaped assumption states 
that there exist p < 1 and vı > v2 > 0, such that for any depth h > 0 and index i = 1,..., Ip, the 
subset Ph; is contained by and contains two open balls of radius vı p” and vp" respectively, where 
the balls are w.r.t. the same semi-metric used in the definition of the function smoothness. 


‘Local smoothness’ is weaker than ‘weakly Lipschitz’ and therefore preferable. Algorithms requir- 
ing the local-smoothness assumption always sample a cell P), ; in a special representative point and, 
in the stochastic case, collect several function evaluations from the same point before splitting the 
cell. This is not the case of HOO, which allows to sample any point inside the selected cell and to 
expand each cell after one sample. This additional flexibility comes at the price of requiring the 
stronger weakly-Lipschitzness assumption. Nevertheless, although HOO does not wait before ex- 
panding a cell, it does something similar by selecting a path from the root to this leaf that maximizes 
the minimum of the U-value over the cells of the path, as mentioned in Section 2.1. The fact that 
HOO follows an optimistic strategy even after reaching the cell that possesses the minimal U-value 
along the path is not used in the analysis of the HOO algorithm. 


Furthermore, a reason for better dependency on the smoothness in other algorithms, e.g., HCT [2], 
is not only algorithmic: HCT needs to assume a slightly stronger condition on the cell, i.e., that the 
single center of the two balls (one that covers and the other one that contains the cell) is actually the 
same point that HCT uses for sampling. This is stronger than just assuming that there simply exist 
such centers of the two balls, which are not necessarily the same points where we sample (which is 
the HOO assumption). Therefore, this is in contrast with HOO that samples any point from the cell. In 
fact, it is straightforward to modify HOO to only sample at a representative point in each cell and only 
require the local-smoothness assumption. In our analysis and the algorithm, we use this modified 
version of HOO, thereby profiting from this weaker assumption. 


Prior work [9, 4, 11, 2, 12] often defined some ‘dimension’ d of the near-optimal space of f measured 
according to the (semi-) metric £. For example, the so-called near-optimality dimension [4] measures 
the size of the near-optimal space Æ = {x € X : f(x) > f(a*) — e€} in terms of packing numbers: 
For any c > 0,€0 > 0, the (c,£0)-near-optimality dimension d of f with respect to £ is defined as 


inf {d € [0, 00) : IC s.t. Ve < £o, N (Xece, b, €) < Ce“, (1) 





where for any subset À C X, the packing number N (A, £, £) is the maximum number of disjoint 
balls of radius £ contained in A. 


2.3 Our assumption 


Contrary to the previous approaches, we need only a single assumption. We do not introduce any 
(semi)-metric and instead directly relate f to the hierarchical partitioning P, defined in Section 1. 
Let K be the maximum number of children cells (Ph+1,jp)1<k< K per cell P;,;. We remind the 
reader that given a global maximum x* of f, i; denotes the index of the unique cell of depth À 
containing æ*, i.e., such that x* € Ph,it- With this notation we can state our sole assumption on 
both the partitioning (Pr) and the function f. 


Assumption 1. There exists v > 0 and p € (0,1) such that 
Yh > 0,Y£ € Pair, f(x) > f (a*) — vp". 


The values (v, p) defines a lower bound on the possible drop of f near the optimum x* according 
to the partitioning. The choice of the exponential rate vp” is made to cover a very large class of 
functions, as well as to relate to results from prior work. In particular, for a standard partitioning on 
R? and any a, 8 > 0, any function f such that f(x) ~zs» Bllx — x*||® fits this assumption. This 
is also the case for more complicated functions such as the one illustrated in Figure 1. An example 
of a function and a partitioning that does not satisfy this assumption is the function f : x œ 1/Inx 
and a standard partitioning of [0, 1) because the function decreases too fast around z* = 0. As 
observed by Valko [15], this assumption can be weaken to hold only for values of f that are 7-close 
to f(x*) up to an 7-dependent constant in the simple regret. 


Let us note that the set of assumptions made by prior work (Section 2.2) can be reformulated using 
solely Assumption 1. For example, for any f(x) ~se» Bl|x — 2*||%, one could consider the semi- 
metric {(x,y) = B||x — y||* for which the corresponding near-optimality dimension defined by 
Equation | for a standard partitioning is d = 0. Yet we argue that our setting provides a more natural 
way to describe the complexity of the optimization problem for a given hierarchical partitioning. 





Indeed, existing algorithms, that use a hierarchical partitioning of 4, like HOO, do not use the full 
metric information but instead only use the values v and p, paired up with the partitioning. Hence, 
the precise value of the metric does not impact the algorithms’ decisions, neither their performance. 
What really matters, is how the hierarchical partitioning of 4 fits f. Indeed, this fit is what we 
measure. To reinforce this argument, notice again that any function can be trivially optimized given 
a perfectly adapted partitioning, for instance the one that associates x* to one child of the root. 


Also, the previous analyses tried to provide performance guaranties based only on the metric and f. 
However, since the metric is assumed to be such that the cells of the partitioning are well shaped, 
the large diversity of possible metrics vanishes. Choosing such metric then comes down to choosing 
only v, p, and a hierarchical decomposition of X. Another way of seeing this is to remark that 
previous works make an assumption on both the function and the metric, and an other on both the 
metric and the partitioning. We underline that the metric is actually there just to create a link between 
the function and the partitioning. By discarding the metric, we merge the two assumptions into a 
single one and convert a topological problem into a combinatorial one, leading to easier analysis. 


To proceed, we define a new near-optimality dimension. For any v > 0 and p € (0,1), the near- 
optimality dimension d(v, p) of f with respect to the partitioning P is defined as follows. 


Definition 1. Near-optimality dimension of f is 


| 





d(p) £ inf {d' ER*:3C > 0, VR>0, Na(Qvp") < Cp}, 


where Nh (€) is the number of cells Pr i of depth h such that sup,ep, , f(x) = f(x*) — €. 


The hierarchical decomposition of the space # is the only prior information available to the algo- 
rithm. The (new) near-optimality dimension is a measure of how well is this partitioning adapted 
to f. More precisely, it is a measure of the size of the near-optimal set, i.e., the cells which are such 
that sup, ep, , f(x) = f(x*) — €. Intuitively, this corresponds to the set of cells that any algorithm 
would have to sample in order to discover the optimum. 

As an example, any f such that f(x) ~zs» ||a — x*||®, for any a > 0, has a zero near-optimality 
dimension with respect to the standard partitioning and an appropriate choice of p. As discussed 
by Valko et al. [15], any function such that the upper and lower envelopes of f near its maximum are 
of the same order has a near-optimality dimension of zero for a standard partitioning of [0,1]. An 
example of a function with d > 0 for the standard partitioning is in Figure 1. Functions that behave 
differently in different dimensions have also d > 0 for the standard partitioning. Nonetheless, for a 
some handcrafted partitioning, it is possible to have d = 0 even for those troublesome functions. 


Under our new assumption and our new definition of near-optimality dimension, one can prove the 
same regret bound for HOO as Bubeck et al. [4] and the same can be done for other related algorithms. 


3 The POO algorithm 


3.1 Description of POO 


The POO algorithm uses, as a subroutine, an optimizing algorithm that requires the knowledge of 
the function smoothness. We use HOO [4] as the base algorithm, but other algorithms, such as 
HCT [2], could be used as well. POO, with pseudocode in Algorithm |, runs several HOO instances 
in parallel, hence the name parallel optimistic optimization. The number of base HOO instances and 
other parameters are adapted to the budget of evaluations and are automatically decided on the fly. 


Each inst f HOO ires t l 
ach instance o requires two rea Algorithm 1 P00 





numbers v and p. Running HOO 


parametrized with (p, v) that are far from 
the optimal one (v,, px)? would cause HOO 
to underperform. Surprisingly, our analy- 
sis of this suboptimality gap reveals that it 
does not decrease too fast as we stray away 
from (14, px). This motivates the follow- 
ing observation. If we simultaneously run 
a slew of HOOs with different (v, p}s, one 
of them is going to perform decently well. 


In fact, we show that to achieve good per- 
formance, we only require (In) HOO in- 
stances, where n is the current number of 
function evaluations. Notice, that we do 
not require to know the total number of 
rounds in advance which hints that we can 
hope for a naturally anytime algorithm. 


The strategy of POO is quite simple: It 
consists of running NV instances of HOO in 
parallel, that are all launched with differ- 
ent (v,p)s. At the end of the whole pro- 
cess, POO selects the instance s* which 
performed the best and returns one of the 
points selected by this instance, chosen 
uniformly at random. Note that just us- 
ing a doubling trick in HOO with increasing 
values of p and v is not enough to guaran- 


Parameters: K, P = {Phi} 
Optional parameters: Pmax; Vmax 
Initialization: 
Dmax + In K/In (1/Pmax) 
n + 0 {number of evaluation performed} 
N + 1 {number of HOO instances} 
S + {(Vmax, Pmax)} {set of HOO instances} 
while computational budget is available do 
while N < ¿Dmax In (n/(Inn)) do 
for i + 1,..., N do {start new HOOs} 
S <— Ca Da ON) 
S¢SU{s} 
Perform N function evaluation with HOO(s) 
Update the average reward ji|s] of HOO(s) 
end for 
n + 2n 
N 2N 
end while{ensure there is enough HOOs} 
for s € S do 
Perform a function evaluation with HOO(s) 
Update the average reward ji[s] of HOO(s) 
end for 
nen+N 
end while 
s* + argmax, cs A 
Output: A random point evaluated by HOO(s*) 





tee a good performance. Indeed, it is important to keep track of all HOO instances. Otherwise, the 
regret rate would suffer way too much from using the value of p that is too far from the optimal one. 





*the parameters (v, p) satisfying Assumption 1 for which d(v, p) is the smallest 


For clarity, the pseudo-code of Algorithm | takes pmax and Vmax as parameters but in Appendix C 
we show how to set Pmax and Vmax automatically as functions of the number of evaluations, 1.e., 
Pmax (N), Vmax (n). Furthermore, in Appendix D, we explain how to share information between the 
HOO instances which makes the empirical performance light-years better. 


Since POO is anytime, the number of instances N (n) is time-dependent and does not need to be 
known in advance. In fact, N(n) is increased alongside the execution of the algorithm. More 
precisely, we want to ensure that 


N(n) > Ds In(n/Inn), where Dmax £ (In K)/In(1/Pmax) : 


To keep the set of different (v, p)s well distributed, the number of HOOs is not increased one by one 
but instead is doubled when needed. Moreover, we also require that HOOs run in parallel, perform the 
same number of function evaluations. Consequently, when we start running new instances, we first 
ensure to make these instances on par with already existing ones in terms of number of evaluations. 


Finally, as our analysis reveals, a good choice of parameters (p;) is not a uniform grid 
on [0,1]. Instead, as suggested by our analysis, we require that 1/In(1/p;) is a uniform grid 
on [0,1/(In1/pmax)]. AS a consequence, we add HOO instances in batches such that p; = Pmax ™ ji, 


3.2 Upper bound on PO0’s simple regret 


POO does not require the knowledge of a (v, p) verifying Assumption | and‘ yet we prove that it 
achieves a performance close” to the one obtained by HOO using the best parameters (v,, px). This 
result solves the open question of Valko et al. [15], whether the stochastic optimization of f with 
unknown parameters (v, p) when d > 0 for the standard partitioning is possible. 


Theorem 1. Let R, be the simple regret of POO at step n. For any (v, p) verifying Assumption 1 
such that V < Vmax and p < pmax there exists k such that for all n 





Ra] < K- ((In? n) a alee : 











Dina 
Moreover, k = a+ Dmax(Vmax/Va) = 


, where a is a constant independent of Pmax ANd Vmax- 

We prove Theorem | in the Appendix A and B. Notice that Theorem | holds for any v < Vmax 
and p < pmax and in particular for the parameters (v,, pẹ) for which d(v, p) is minimal as long as 
Vy < Vmax and py < pmax. In Appendix C, we show how to make pmax and Vmax optional. 


To give some intuition on Dmax, it is easy to prove that it is the attainable upper bound on the near- 
optimality dimension of functions verifying Assumption | with p < pmax. Moreover, any function 
of [0, 1]?, Lipschitz for the Euclidean metric, has (In K)/1n (1/p) = p for a standard partitioning. 


The POO’s performance should be compared to the simple regret of HOO run with the best parame- 
ters v, and p,, which is of order 


© (inn) /ny e, 


Thus POO’s performance is only a factor of O((In n)” (AGP 2+2) away from the optimally fitted 
HOO. Furthermore, our simple regret bound for POO is slightly better than the known simple regret 
bound for StoS00 [15] in the case when d(v, p) = O for the same partitioning, i.e., E[R;] = 
O(nn/,/n).With our algorithm and analysis, we generalize this bound for any value of d > 0. 














Note that we only give a simple regret bound for POO whereas HOO ensures a bound on both the cu- 
mulative and simple regret.° Notice that since POO runs several HOOs with non-optimal values of the 
(v, p) parameters, this algorithm explores much more than optimally fitted HOO, which dramatically 
impacts the cumulative regret. As a consequence, our result applies to the simple regret only. 





‘note that several possible values of those parameters are possible for the same function 
Sup to a logarithmic term VIn n in the simple regret 
Sin fact, the bound on the simple regret is a direct consequence of the bound on the cumulative regret [3] 
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Figure 2: Simple regret of POO and HOO run for different values of p. 


4 Experiments 


We ran experiments on the function plotted in Figure | for HOO algorithms with different values of p 
and the POO’ algorithm for pmax = 0.9. This function, as described in Section |, has an upper and 
lower envelope that are not of the same order and therefore has d > 0 for a standard partitioning. 


In Figure 2, we show the simple regret of the algorithms as function of the number of evaluations. 
In the figure on the left, we plot the simple regret after 500 evaluations. In the right one, we plot 
the simple regret after 5000 evaluations in the log-log scale, in order to see the trend better. The 
HOO algorithms return a random point chosen uniformly among those evaluated. POO does the same 
for the best empirical instance of HOO. We compare the algorithms according to the expected simple 
regret, which is the difference between the optimum and the expected value of function value at the 
point they return. We compute it as the average of the value of the function for all evaluated points. 
While we did not investigate possibly different heuristics, we believe that returning the deepest 
evaluated point would give a better empirical performance. 


As expected, the HOO algorithms using values of p that are too low, do not explore enough and 
become quickly stuck in a local optimum. This is the case for both UCT (HOO run for p = 0) and 
HOO run for p = 0.3. The HOO algorithm using p that is too high waste their budget on exploring 
too much. This way, we empirically confirmed that the performance of the HOO algorithm is greatly 
impacted by the choice of this p parameter for the function we considered. In particular, at T = 500, 
the empirical simple regret of HOO with p = 0.66 was a half of the simple regret of UCT. 


In our experiments, HOO with p = 0.66 performed the best which is a bit lower than what the theory 
would suggest, since p, = 1/2 ~ 0.7. The performance of HOO using this parameter is almost 
matched by POO. This is surprising, considering the fact the POO was simultaneously running 100 
different HOOs. It shows that carefully sharing information between the instances of HOO, as described 
and justified in Appendix D, has a major impact on empirical performance. Indeed, among the 100 
HOO instances, only two (on average) actually needed a fresh function evaluation, the 98 could reuse 
the ones performed by another HOO instance. 


5 Conclusion 


We introduced POO for global optimization of stochastic functions with unknown smoothness and 
showed that it competes with the best known optimization algorithms that know this smoothness. 
This results extends the previous work of Valko et al. [15], which is only able to deal with a near- 
optimality dimension d = 0. POO is provably able to deal with a trove of functions for which d > 0 
for a standard partitioning. Furthermore, we gave a new insight on several assumptions required by 
prior work and provided a more natural measure of the complexity of optimizing a function given a 
hierarchical partitioning of the space, without relying on any (semi-)metric. 
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A Proof sketch of Theorem 1 
In this part we give the roadmap of the proof. The full proof is in Appendix B. 


First step For any choice of p, verifying Assumption | and any suboptimal p such that 
0< <p<1, 


we bound the difference of near-optimality dimension, 





1 1 
d(p)—d(p,) < in ( ) 
eee (1/9) in (/e,) 
and deduce that D 
ISo [d(pi) = d(px)] < -N 


Second step By simultaneously running a large number of HOO instances, we ensure that for all 
Ps < Pmax, one of them uses a p close to p, and therefore suffers a low regret. On the other hand, 
simultaneously running a large number of HOOs has a cost, as more evaluations need to be done at 
each step, one for each HOO. We optimize this tradeoff to deduce the following good choice of 6, 


which is the maximum distance |d (p;) — d(p;)|, where i and j are two consecutive HOOs. 


5 = O(In(t/Int). 


Third step Using the result of the second step, we can compute the simple regret R? of the HOO 
instance running with the parameter p > p,, which is the closest to p,. Note that, as POO is running, 
the instance it choose may change over time and so p depends on n. 


We prove that there exists a constant aœ > 0 such that for all n, Vmax > 0, and pmax < 1, 


Ro <a- Dnaxc(Vmax/V4) P" (in? n) Jn) 1/(d(p)+2) 
Fourth step At the end of the algorithm, we empirically determine which HOO performed the best. 
However, this best empirical instance may not be the instance running with p closest to the optimal 


unknown p,. Nonetheless, we prove that this error is small enough such that it only impacts the 
simple regret by a constant factor. 


B Full proof of Theorem 1 


B.1 First step 


We show that for any choice of p, verifying Assumption | and any p such that 0 < p, < p < 1, 





1 1 
d (p) — d (px) < n K E ae) 


We start by defining Z} (£) as the set of cells of depth h which are e-near-optimal, 
Tn (e)= di: sup f(z) > f(x) —e} 
TE Phi 


NW (E), defined in Section 1, is then equal to the cardinality of Z, (£). Notice that if a cell (h, i) is 
€-near-optimal then all of its antecedents are also £-near-optimal. Therefore, for any € and h’ > h, 
the cells in Z» (£) are descendants of the cells in Z, (e). 


Since the number of descendants at depth h’ of a cell at depth h’ > h is bounded by K h'—h we 
bound the cardinality M, (€) of Zw (€), 


Ve,Vh'>h, Nye) < KP TAN, (6). 
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By definition of the near-optimality dimension, we know that for any v > 0 and p, € (0,1), there 
exists C such that for all h, 


Nh (2vp") op or, 
We define C'(v, p) as the smallest C verifying the above condition. 


For any 0 < v, < v, 0 < p, < p < 1 and any integer h > hmin = In(v/v,)/In(1/p) let us define 
h, as the greatest integer such that vp” < v,p}+. From this definition, we get vp” > v,pl++1 from 
which we deduce that 

np Inv-Iny, 


h,>h- 
7 In px In px 





and then 





h= ha < hp ( - 


1 ln p, + ln v, — lnv 
mp Inp,/ | 


Inp 
Since Mp (€) is not increasing in €, vp” < v,p!* implies 
Na (2vp") < Ni (2vxp%* ). 
We now put everything together to obtain 
Ny (2up") < Na (2ra pk) 


< K? Nh, (214p?) 


< Kon px+ln vą—ln v)/ln p+h, Inp,(1/ In p—1/ In PAO —d(px)hs 


Vx, Px) Px 
< C(v,, p) KE px+ln vą—ln v)/ln P pp = lox) +n K(1/In(1/p)—-1/In(1/p,))] | 


From vp" < v,p}+ and v, < v we get p™” > p,"* and therefore 
Na (2up") om pz) KO px tin v,—Inv)/In P pThlalp)+n K(1/1a(1/p)—1/ m0/p4))], 
We just proved that there exists C such that for all h > 0 
Np (2vp") < Ca esa K(1/M(1/p)—1/1n(1/9,))] 
By taking 
C ĉ max (CE Fees À: 


we deduce by the definition of the near-optimality dimension the following bound 





1 1 
d(p) < d(p,) +n K Gas In es) 


We can now deduce that POO should use p; parameters that satisfy 
1 A à 1 
In (1/p;) Nin (1/Pmax) 


where N is the total number of HOO instances run andi € {1,..., N}. 





We now define J as the closest p; to p, used by an existing HOO instance, such that p; > px. 


p = arg min [d (p;) — d (px)] 
PiZPx 


Since we assumed that p, < Pmax, we know that 





with Dmax = (In K)/ln (1/pPmax). 
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B.2 Second step 


Let us now compute the optimal number of N instances to run in parallel. We bound the logarithm 
of the simple regret R;’? of a single HOO instance using parameters v and p after this particular 
instance performed t function evaluations. In particular, we bound the simple regret by a linear 
approximation for p ~ p,. In the following, 5 is a numerical constant coming from the analysis of 
HOO [4]. For all t > 0, we have 


v, nC(v,p) ln(t/lnt) 
ln RY’? < In B + 2+ dp) 2+ de) 

















as InC(v,p) _ In(¢/Int) © 1 

“mEt oF ao) 2+d(p.) 1+ (de) d) CTi) 
inC(,p) _In(t/Int) f, do) dlp) 

ET a. 2 + dlp.) } 


After n function evaluations by POO, each instance performed at least £ = [n/N] function eval- 
uations. We can now bound the simple regret R°00? of the HOO instance using v and J after n 
evaluations performed by all the instances 


5 In C(y, p) In|[n/N | 1 Drax /N 
In REO? < In B + ——2** + In . (2) 
2+ d(p) [n/N] J \2+d(p,) (2+ d(p))? 
Optimizing this upper bound for N leads to the following choice of N, 
Nw ie In(n/Inn). 





Therefore, in POO we choose to ensure N > Dis In(n/Inn). 


If the time horizon was known in advance, N could be any integer. Nevertheless, since the algorithm 
is anytime, all the previous HOO instances have to be kept and new instances need to be added in 
between. Therefore, we restrict N to be of the form 2°, for i € N. 


As a consequence of this choice, N can be at most 2 times its lower bound and therefore 


$Dmax In (n/Inn) < N < Dax m(n/Inn). 


B.3 Third step 


Using our choice of N, we can bound the simple regret of the HOO instance using p. We proceed by 
separately bounding each of the terms in Equation 2. 


InC(y, p) z 1 
2+d(p) T 2+d(p;) 





In C (v, p) 


< ———_Inmax (c A KOmpstiny.“inv)/99, phase) 


1 In1/p,  In(v/r,) i 
<—— max| Gp tink in | Me/¥«)/1n1/p) 
~ 24d ex (in C (espe) i (me j ln1/p mf | 

















1 In K In py Dmax ln K In es v 
< ——— l Px) + 52 + ; Dinax | z 
< (in Cr.) max ( N ) nis don 
D 
< +=] max/ “k 
< yA g a In (vmax/2a) 


In the last expression, ~y is a quantity independent of Vmax, Pmax, and N. 


We now use N < Dmax In (n/ Inn) to get 


m (ar 


LR) < in Dye nnn (n/a) fn). 
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To bound the last term, we use Dis In(n/Inn) < N to get 


In| n/N | Drax /N 1 n. 1 1 
in( [n/N | ) (2+ d(p,))? Sin (= Inn mais) 2In(n/Inn) <2 


We can finally bound the simple regret RP°°-? of the HOO instance using z after n function evaluations 
overall. Combining the results above, we know that for all n, Vmax, and pmax, 








- 1/(2+d(p)) 
RPO? < Bexp(y +2) (Dmax (Ymax/U4) 2 (hin) In(n/ Inn) /n) 


We bound In (n/ ln n) by In n to get the following bound. There exists a that is independent of pmax 
and Vmax, Such that 


HER <a: Dans ur ((In? n) yer r 


B.4 Fourth step 


= 


Let (Xi,j)i<n,j<w be a family of points in X evaluated by POO. We denote f(X; j) the noisy eval- 


= 


uation at X; j and f(Xi,;) = E[f(X;,;)]. We also define: 














A 1 ~ A 1 à 
ny = — IXs) fy = — D7 F(X) 
¢=1 i=1 
+ À oS A + 
J = arg max jj J = arg max fj 
1<j<N 1<j<N 


By Hoeffding-Azuma inequality for martingale differences, for any A > 0, 


Ago (2a, 


n 


P >, FX) = (Xiz) > nA 
i=1 





Therefore 
P|; — uj > A] < exp (—2nA?’). 


As we have 
= 


€ 
2/n 





Va > 0,a-exp (—2nx°) < 


we can now integrate exp (—2nA?) over A € [0, 1] to get 














[fy — My] < 





Š 


Now consider 


















































s [es — 15] = Elu; — fy] + tay | + A 


Notice that the first and last term are both bounded by e~?/ (2,/n) and the middle term is negative. 
Finally, taking a union bound over the N variables u; we get 


e? N 
a 














i |e, = 15, < 


As N = o (lnn), we conclude that this additional term is negligible with respect to 


(Inn In (n/Inn) Jn) CtP) , 


13 


C Increasing sequence for pmax and Vmax 


Besides the number K of children for each cell, POO needs two parameters, pmax € (0,1) and 
Vmax > 0. Theorem | states that POO run with those parameters performs almost as well as the 
best instance of HOO run with v < Vmax and p < pmax, 1.€., corresponding to the near-optimality 
dimension min{d(v, p), V < Vmax, P < Pmax}. 


Therefore, the larger the values pmax and Vmax used by POO, the wider the set of HOO instances that 
we can compete with. Nevertheless, large values of pmax and Vmax impact the performance by a 
multiplicative constant of order DmaxVmax”™?*. This tradeoff between performance and size of our 
comparison class is unfortunate but unavoidable. 


In practice, as we strive for an algorithm that does not require the knowledge of the smoothness 
we may increase the values of pmax(n) and Vmax(n) with the number of evaluations n, so that 
the class of functions covered by POO gets bigger with the numerical budget. Nevertheless, the 
increase should be slow enough so that we do not compromise the performance. In particular, we will 
require that Vmax(n)? max(") does not increase too fast. In fact, any sequence Pmax(n) converging 
to 1 and Vmax(n) diverging to infinity impacts the simple regret by an additive term which is the 
smallest time n such that p* < Pmax(n) and v* < Vmax(n), i.e., the first time the assumptions are 
verified. A slowly increasing sequence means a smaller impact on the simple regret rate but a higher 
additive term (a constant independent of n). Any sensible choice of increasing sequence pmax(n) 
and Vmax(n), impacting the rate by only a subpolynomial factor, is a valid choice. 


Algorithm | is described using constant pmax and Vmax for clarity, but its implementation is easily 
modifiable to deal with increasing values of these two parameters while preserving the anytime prop- 
erty of the algorithm, as follows. At any time, all the HOO instances must use the same Vmax parame- 
ter. On the other hand, considering pmax, the value of Dmax has to be increased such that the already 
running HOO instances stay relevant. One way to do that is to increase Dmax as Dmax( N +1)/N and 
run an additional HOO instance. An alternative solution is to perform, each time when needed, the 
following increment pmax + 4/Pmax and run N additional HOO instances with parameters Peas / a 
fori € {1,..., N}. 


D Information sharing among parallel runs 


Since we run several instances of HOO on the same partitioning of X, we may think of sharing the 
samples among them, in order to decrease the estimation error. However, this needs to be done care- 
fully in order to avoid adding unwanted bias in the estimation of the U values in the HOO instances. 
Ideally, each HOO instance would reuse all function evaluations acquired by all other instances. Un- 
fortunately, this solution would not easily come with theoretical guaranties, as this would reduce 
artificially the confidence intervals at some cells and introduce search bias. 


Instead, whenever a HOO instance requires a function evaluation, we perform a look-up to find out 
whether another HOO instance has already evaluated f at this point. In affirmative, then instead of 
evaluating the function at this point again, we simply reuse the sample. This way, HOO instances 
are not given access to samples they never asked for. However, the empirical simple regrets of 
HOOs becomes correlated with each other. This is not a problem because in B.4, we do not assume 
the independence between empirical means of HOOs, only the independence of rewards within each 
instance—which still holds. Therefore with this modification, our theoretical guaranties continue to 
apply. Note that if all the instances share all their rewards, then they are all equivalent and there is 
no mistake possible. Then one can show, that the worst case is when no rewards are shared and the 
error due to choosing the wrong instance actually decreases when the information is shared. 


Finally, we want to stress that sharing information is extremely important in practice, as our ex- 
periments reveal. Since the number of HOO instances can be very large” one could expect the per- 
formance of POO to be pitiful. However, as the vast majority of the function evaluations are in 
practice shared, POO performs almost as well as HOO fitted with the best parameters. Summing up, 
although the performance bound on the simple regret with this modification is the same, empirical 
performance improves tremendously. 





Seven though it scales only as In n with the number of evaluations n, it does not scale well with pmax 
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E Zero-quality functions 


For any p € (0,1), we construct a locally Lipschitz function with a rate p and a constant v = 1 that 
POO can provably optimize and its quality, as defined by Definition 2, is zero. In order to properly 
define the quality, we use the uniform distribution on [0, 1] to sample from a node of the partitioning. 


Definition 2 (Slivkins [14]). The quality is the largest q € (0,1) such that for each subtree v 
containing the optimum, there exist nodes u and u’ such that P(u|v) and P(u/|v) are at least q and 


F) = AUZ 3 sup IE) — FL 


We construct such function f on the interval [0,1], its maximum being attained in x* = 0 with 
f (0) = 0. For any x Æ 0 we define f as follows. For any h > 0 we define f on Gr: | as 


vee ( 1 cae), i 








2h+1 Qh+1 
1+1/(hR+1) 1 p” 
Yr ( Dh+1 , oh > f(x) = T3 


We also consider the standard partitioning on [0, 1]. 


The optimal node of depth h corresponds to the interval [0, me By our definition of f, 


Vae[0,2-"], f(0) — f(x) <p" 
#0) a 2) = 


from which we conclude that f is locally Lipschitz with rate p and therefore can be optimized by 
POO with provable finite-time guarantees (Theorem 1). 





Now we prove that the quality of this function is zero. Using Definition 2, we can do it by showing 
that there exists no such q € (0,1), for which there could be a node v along the optimal path with u 
and u’ verifying P(u|v) > q (and same for u’) such that 


ieee 


reu xeu! ZEU 2 


(3) 


Let q be a real number from (0, 1) and consider any A > 1/q. We pick v = [0,2 "|. 





























OO 





2 1 > 1 i 1 
~ 244 (h+k+1) Ah +1) 
1 
=a 
Notice that if u’ verifies (3), then w’ is included in {x € v : f(x) < —p"/2}. Combined with the 
equation above, we have that 


P(ulv) < P({x € v : f(x) < —p"/2} |v) <a, 


which is a contradiction. Since this holds for any g > 0, we deduce that the quality of f is zero. 
Yet f is Lipschitz with rate p € (0,1) and therefore f can be optimized by POO. 


q 
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